Systems and methods for annotating datasets

ABSTRACT

Disclosed herein are systems and methods for joining datasets. The system may include one or more processors and a memory storing instructions that, when executed by the one or more processors. The processor may cause the system to perform determining at least a first database table to be annotated, the first database table including a set of columns and rows of a dataset. In some embodiments, the system may include determining at least one typeclass that applies to one or more columns included in the first database table, wherein the typeclass describes values stored in the one or more columns and annotating the one or more columns, wherein the annotated columns are associated with the typeclass.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S.Provisional Application Ser. No. 62/598,299 filed Dec. 13, 2017, thecontent of which is incorporated by reference in its entirety into thepresent disclosure.

BACKGROUND Technical Field

This disclosure relates to approaches for annotating datasets, and moreparticularly, for annotating datasets based on identified relationshipsbetween the datasets.

Description of Related Art

Databases may often include a considerable number of datasets amassedtogether without any foundational organization or structure. As aresult, while such databases may include the necessary data to provide awide array of information for data analysis, users may not be able tolocate the necessary data needed to combine, join, and bolster a selectdataset for any such meaningful analysis.

Conventional approaches may limit the analysis of data for users whohave no way of knowing the types of information already stored indatasets being managed by the database. Indeed, a database may includehundreds, if not thousands, of disparate datasets. Additionally, otherconventional approaches may join datasets within a database, but mayjoin the datasets up-front, often before a user has a chance to evaluatewhether the datasets are comparable. Thus, conventional approaches maymake it difficult to identify the datasets that can be joined together.Conventional approaches may also limit the flexibility of users who havenot decided whether they would like to join any such datasets together.

SUMMARY

Various embodiments of the present disclosure include systems, methods,and non-transitory computer readable media configured to annotate andjoin datasets. The system may include one or more processors and amemory storing instructions that, when executed by the one or moreprocessors, cause the system to perform determining at least a firstdatabase table to be annotated, the first database table including a setof columns and rows of a dataset. Some embodiments may further includedetermining at least one typeclass that applies to one or more columnsincluded in the first database table, wherein the typeclass describesvalues stored in the one or more columns and annotating the one or morecolumns, wherein the annotated columns are associated with thetypeclass.

In some embodiments, the typeclass describes metadata information forthe one or more columns.

In some embodiments, the metadata information describes a data typecorresponding to data values in the one or more columns.

In some embodiments, the metadata information describes a data formatcorresponding to data values in the one or more columns.

In some embodiments, the typeclass is associated with one or more datavalidations.

In some embodiments, the data validations are automatically applied tovalidate values stored in the one or more columns.

In some embodiments, the systems, methods, and non-transitory computerreadable media are configured to join the first database table with atleast one second database table based at least in part on the one ormore annotated columns.

BRIEF DESCRIPTION OF THE DRAWINGS

Features of various embodiments of the present technology are set forthwith particularity in the appended claims. A better understanding of thefeatures and advantages of the technology will be obtained by referenceto the following detailed description that sets forth illustrativeembodiments, in which the principles of the technology are utilized, andthe accompanying drawings of which:

FIG. 1 is a diagram of an example of a dataset relationship managementenvironment, per some embodiments.

FIG. 2 is a diagram of an example of a method, per some embodiments.

FIG. 3 is a diagram of an interface configured to display data sets, persome embodiments.

FIG. 4 is a diagram of an interface configured to provide a typeclassselection for a particular dataset, per some embodiments.

FIG. 5 is a diagram of an interface configured to display a joinoperation of a first dataset and a second dataset, per some embodiments.

FIG. 6 depicts a block diagram of an example of a computer system uponwhich any of the embodiments described herein may be implemented.

DETAILED DESCRIPTION

A claimed solution rooted in computer technology overcomes problemsspecifically arising in the realm of computer technology. In variousembodiments, an interface for curating and managing data is provided.The interface can allow users to associate columns of database tableswith various typeclasses. In some embodiments, a typeclass can be usedto describe metadata information for a given column. In someembodiments, a typeclass can be associated with one or more validationsthat can be used to evaluate data corresponding a given column. In someembodiments, datasets can automatically be enriched based ontypeclasses. For example, typeclasses can be used to join data stored indisparate database tables. In some embodiments, typeclasses canautomatically be suggested for database tables. For example, a typeclasscan automatically be suggested for a given column based on a data typeassociated with the column or a pattern recognized in data values storedin the column.

FIG. 1 is a diagram of an example of a dataset relationship managementenvironment 100, per some embodiments. The dataset relationshipmanagement environment 100 shown in FIG. 1 includes one or moredatabase(s) 102 (shown as a first database 102(1) through an Nthdatabase 102(N) (where “N” may represent an arbitrary integer)) and adataset relationship management system 104. The database(s) 102 and thedataset relationship management system 104 may be coupled to one anotherthrough one or more computer networks (e.g., LAN, WAN, or the like) oranother transmission media. The computer networks and/or transmissionmedia may provide communication between the database(s) 102 and thedataset relationship management system 104 and/or between components inthose systems. Communication networks and transmission mediums arediscussed further herein.

The database(s) 102 may include one or more databases configured tostore data. The database(s) 102 may include tables, comma-separatedvalues (CSV) files, structured databases (e.g., those structured inStructured Query Language (SQL)), or other applicable known orconvenient organizational formats. The database(s) 102 may supportqueries and/or other requests for data from other modules, such as thedataset relationship management system 104. In some embodiments, thedatabase(s) 102 may provide stored data in response to thequeries/requests. The databases may include “datasets,” which as usedherein, may refer to collections of data within a database. A datasetmay include all data in a database that follows a specific format orstructure. In some embodiments, a dataset corresponds to one or moredatabase tables that include one or more columns and one or more rows.In such embodiments, database tables can be populated with data valuesin corresponding rows and columns.

The dataset relationship management system 104 may include modulesconfigured to annotate datasets (e.g., database table columns),determine relationships between such datasets, and enrich such datasetswith other related datasets. The dataset relationship management system104 includes an interface engine 106, a dataset identification engine108, a typeclass engine 110, an annotation engine 112, an enrichmentengine 114, and a typeclass suggestion engine 116.

The interface engine 106 can be configured to provide an interface(e.g., graphical user interface, application programming interface) forcurating and managing datasets. The interface may be provided through adisplay screen of a computing device by a software application (e.g.,web browser, standalone app, etc.) running on the computing device. Thecomputing device can include one or more processors and memory. Forexample, in some embodiments, the interface can provide options forannotating datasets (e.g., database table columns) with varioustypeclasses. In some embodiments, a typeclass can be used to describemetadata information for a given dataset. In some embodiments, atypeclass can be associated with one or more validations that can beused to evaluate data corresponding a given dataset. In someembodiments, the interface can provide options for enriching datasets.For example, in some embodiments, the interface can provide options forenriching data in a first dataset (e.g., database table) with data froma second dataset (e.g., database table). In some embodiments, theinterface can provide typeclass suggestions for annotating datasets(e.g., database table columns). Many variations are possible.

The dataset identification engine 108 may be configured to identifydatasets of interest in the database(s) 102. The dataset identificationengine 108 may be configured to execute specific queries to identifyspecific or related datasets from the database(s) 102. In someembodiments, the dataset identification engine 108 may identify datasetslocated in tables, and thus identify the specific columns and/or rowsassociated with a selected database table from the database(s) 102. Insome embodiments, the dataset identification engine 108 is configured toidentify and search tables and their columns and rows from thedatabase(s) 102. In various embodiments, the dataset identificationengine 108 is configured to identify datasets that match date and/ortime ranges, responsive to keyword searches, fall within subject areasof interest, responsive to structured and/or unstructured queries,and/or the like. In some implementations, the dataset identificationengine 108 receives instructions from a user to identify the datasets ofinterest. The dataset identification engine 108 may also receiveinstructions from automated agents, such as automated processes executedon the dataset relationship management system 104, to identify thedatasets of interest. In one example, the automated agents may look tosee if data correlates to another dataset based on an overlap of datatype, similar data structure, similar subject area of interest, and/orthe like. The dataset identification engine 108 may provide theidentified datasets of interest to one or more other engines forprocessing (or analysis).

The typeclass engine 110 may be configured to store typeclasses andassign typeclasses to particular datasets (e.g., database table columns)in the database(s) 102. In some embodiments, a typeclass can be used todescribe metadata information for a given column. For example, atypeclass can be associated with a pre-determined data format or type.For example, a “social security number” typeclass may be associated witha pre-determined data format that corresponds to a 9 digit number thatmay formatted as three digits, followed by a hyphen, another two digits,followed by a hyphen, and finally four more digits (e.g.,“###-##-####”). In another example, a “user identifier” typeclass may beassociated with a pre-determined type that identifies data values asbeing user identifiers in a given database table. In some embodiments,columns in a database table can be annotated (or labeled) using one ormore typeclasses to identify the type of data values that are includedin a given column. For example, a column that includes data valuescorresponding to social security numbers can be annotated using the“social security number” typeclass. In another example, a column thatincludes data values corresponding to user identifiers can be annotatedwith the “user identifier” typeclass. Many variations are possible. Forexample, a column may be associated with the “email address” typeclassif data values in the column satisfy an email address format associatedwith the typeclass (e.g., “username” followed by “@” followed by atop-level domain). In another example, a column annotated with a “phonenumber” typeclass may include data values that correspond to a threedigit number followed by a hyphen followed by another three digit numberfollowed by a four digit number (e.g., “###-###-####”). In variousembodiments, a data format associated with a typeclass can be defined asa pattern using regular expressions. Again, many variations arepossible.

In various embodiments, the typeclass engine 110 may be configured toensure data values satisfy various validation rules. For example, insome embodiments, a typeclass can be associated with one or morevalidations (or validation rules) that can be used to evaluate datacorresponding to a given dataset (e.g., database table column). Suchdata validation rules can vary depending on the typeclass. For example,a “username” typeclass may be associated with a first data validationrule that ensures data values are alphanumeric and a second validationrule that ensures data values are all lowercase. In this example, datavalues in a table column annotated with the “username” typeclass can bevalidated using both the first data validation rule and the second datavalidation rule. Additionally, in some embodiments, the typeclass engine110 may further be configured such that any data values from two or morecolumns are only combined (or joined) if the data values are associatedwith the same typeclass.

The annotation engine 112 may be configured to annotate datasets (e.g.,columns in a database table) with one or more typeclasses. In variousembodiments, the annotation engine 112 can be instructed to annotate agiven database table column with a given typeclass either manually orusing automated approaches. For example, in some embodiments, a user canmanually identify datasets to be annotated using one or more selectedtypeclasses. In some embodiments, the annotation engine 112 canautomatically identify datasets to be annotated using one or moresuggested typeclasses. For example, in some embodiments, the annotationengine 112 can evaluate data values in a given database table column toinfer, or determine, whether the data values correspond to particulardata types and/or data formats. For example, the annotation engine 112can determine that the evaluated data values correspond to emailaddresses. In some embodiments, this determination can be made in viewof keywords (e.g., text) identified in the evaluated data values. Forexample, the annotation engine 112 can determine that portions of theevaluated data values include domain names of popular email services. Inthis example, the annotation engine 112 can determine that the datavalues correspond to email addresses. In another example, the annotationengine 112 can determine that the evaluated data values correspond toemail addresses based on pre-defined data formats (or patterns)reflected in the data values. For example, the data values may tend tofollow a same, or similar pattern such as including an at sign (i.e.,“@”), followed by a domain name, and then a top-level domain (e.g.,“.com”, “.org”, etc.). Many variations are possible. Next, theannotation engine 112 can determine any typeclasses that are applicableto those determined particular data types and/or data formats. Forexample, the annotation engine 112 can determine whether the data typesand/or data formats match data types and/or data formats associated withany typeclasses. When a match is determined, the annotation engine 112can automatically suggest a corresponding typeclass associated with thematching data types and/or data formats to be used for annotating thedatabase table column. In the foregoing example, the database tablecolumn from which the data values were evaluated can be labeled (orannotated) with a typeclass that corresponds to email addresses. Oncelabeled, the data values included in the labeled database table columncan be subsequently identified based on its typeclass without having tore-evaluate the data values.

The enrichment engine 114 may be configured to identify relationshipsbetween datasets identified, for example, by the dataset identificationengine 108, typeclass engine 110, and/or the annotation engine 112. Insome embodiments, a first dataset and a second dataset may be determinedto be related if the two datasets share at least one typeclass. Forexample, a first database table may include a column associated with a“username” typeclass. Similarly, a second database table may includealso include a column associated with the “username” typeclass. In thisexample, these columns and tables can be determined to be related basedon the shared “username” typeclass. As a result, data in the firstdatabase table can be enriched using data from the second databasetable. Many variations are possible.

In some embodiments, the enrichment engine 114 may be used to determinewhich data values from datasets can be joined or combined. For example,a column (e.g., a column that includes data corresponding to vehiclemodels) in a first database table corresponding to vehicle informationcan be annotated with a “vehicle model” typeclass. Similarly, a column(e.g., a column that also includes data corresponding to vehicle models)in a second database table can also be annotated with the “vehiclemodel” typeclass. In this example, the enrichment engine 114 candetermine that both the first and second database tables can be joinedat the columns annotated with the “vehicle model” typeclass. In someembodiments, the enrichment engine 114 automatically generates a view ofthe joined datasets. In some embodiments, the enrichment engine 114provides the datasets as suggested joins. Many variations are possible.

In some embodiments, the enrichment engine 114 may analyze datasets todetermine similar or related data qualities, data characteristics,and/or data patterns. In such embodiments, the enrichment engine 114 canautomatically associate datasets with appropriate typeclasses based onthe analysis. For example, a database table column may include valuesthat correspond to a particular data format. In this example, thisparticular data format may be associated with a given typeclass. As aresult, the enrichment engine 114 can automatically associate thedatabase table column with the typeclass. For example, a database tablecolumn that includes values corresponding to a social security numberformat can automatically be annotated with a “social security number”typeclass that is associated with the social security number format.Many variations are possible. Additionally, in some embodiments,typeclasses assigned to a first database column can automatically beassigned (or propagated) to a second database column based on an overlapbetween data values in the first database column and data values in thesecond database column. For example, an overlap can be contextuallyand/or statistically determined based on some relationship between thetwo sets of data values. Such relationships may be determined based onthe data values themselves and/or on data formats determined for thosedata values, as described above. Many variations are possible. Forexample, data values (e.g., A152-MAX, A152-JUMBO, B550-TWIN, etc.) in afirst database table column may be annotated using product modeltypeclass. In this example, the enrichment engine 114 can evaluate datavalues (e.g., A220-WIDE, A680-JET, C110-PRO, etc.) in a second databasetable column to determine a likelihood of correspondence between datavalues in the first database table column and data values in the seconddatabase table column. If the likelihood of correspondence between datavalues in the first database table column and data values in the seconddatabase table column satisfies a threshold (e.g., at least a 90percent), then the enrichment engine 114 can automatically label thesecond database table column using the product model typeclass. Whendetermining a likelihood of correspondence between a first set of datavalues and a second set of data values, the enrichment engine 114 mayconsider many factors. For example, in some embodiments, the enrichmentengine 114 may determine whether a data type (e.g., numerical, Boolean,string) of the first set of data values matches a data type of thesecond set of data values. In some embodiments, the enrichment engine114 may determine a likelihood correspondence based on a threshold matchbetween the first set of data values and the second set of data values.In some embodiments, the enrichment engine 114 may compute an aggregatededit distance between the first set of data values and the second set ofdata values. In such embodiments, a likelihood of correspondence betweenthe two sets of data values increases as the aggregated edit distancedecreases.

In some embodiments, the typeclass suggestion engine 116 canautomatically suggest typeclasses for annotating various datasets. Forexample, in some embodiments, a typeclass can automatically be suggestedfor a given column of a database table based on a data type associatedwith the column. For example, a database table column can include datavalues corresponding to timestamps. In this example, the typeclasssuggestion engine 116 can evaluate the data values to determine that atypeclass corresponding to “timestamps” can be used to annotate thedatabase table column. In some embodiments, a typeclass canautomatically be suggested for a given column of a database table basedon a pattern (e.g., data format) recognized in data values stored in thecolumn. For example, a database table column can include data valuescorresponding to mailing addresses. In this example, the typeclasssuggestion engine 116 can evaluate the data values to determine that atypeclass corresponding to “physical address” can be used to annotatethe database table column. Many variations are possible. In someembodiments, the typeclass suggestion engine 116 can utilize pre-definedregular expressions to evaluate data values. In some embodiments, thetypeclass suggestion engine 116 can utilize trained machine learningmodels to evaluate data values. For example, a machine learning modelcan be trained to determine that a set of data values correspond to aparticular data type. Similarly, a machine learning model can be trainedto determine that a set of data values correspond to a particular dataformat. In some embodiments, typeclasses for a first database tablecolumn can be suggested based on an evaluation of data values includedin that column. In such embodiments, if the evaluated data valuescorrespond to (or overlap with) data values included in a seconddatabase table column with a threshold likelihood, then any typeclassesassigned to the second database table can be provided as suggestions forannotating the first database table. In some embodiments, correspondencebetween data values can be determined based on a likelihood ofcorrespondence, as described above. In some embodiments, typeclasses fora first database table column can be suggested based on historical joindata describing join operations that were performed using the firstdatabase table column. For example, if the first database table columnwas joined with a second database table column at least a thresholdamount of times (e.g., number, percentage) by various users, then anytypeclasses associated with the second database table column can beprovided as suggestions for annotating the first database table column.In some embodiments, database table columns that were previously joinedwith the first database table column can be ranked. In some embodiments,the database table columns can be ranked based on a respective count ofjoin operations that were performed between the database table columnand the first database table column. In some embodiments, typeclassesfor database table columns that satisfy a threshold rank can be providedas suggestions for annotating the first database table column. In someembodiments, typeclasses for database table columns that were mostrecently joined with the first database table column can be provided assuggestions for annotating the first database table column. Manyvariations are possible.

FIG. 2 illustrates an example method 200, per some embodiments. Theoperations of method 200 presented below are intended to beillustrative. In some implementations, method 200 may be accomplishedwith one or more additional operations not described, and/or without oneor more of the operations discussed. Additionally, the order in whichthe operations of method 200 are illustrated in FIG. 2 and describedbelow is not intended to be limiting.

In some implementations, method 200 may be implemented in one or moreprocessing devices. The one or more processing devices may include oneor more devices executing some or all of the operations of method 200 inresponse to instructions stored electronically on an electronic storagemedium. The one or more processing devices may include one or moredevices configured through hardware, firmware, and/or software to bespecifically designed for execution of one or more of the operations ofmethod 200.

At an operation 202, at least a first database table to be annotated isdetermined, the first database table including a set of columns and rowsof a dataset. Operation 202 may be performed by one or more physicalprocessors executing one or more engines as described above, inaccordance with one or more implementations.

At an operation 204, at least one typeclass that applies to one or morecolumns included in the first database table is determined, wherein thetypeclass describes values stored in the one or more columns. Operation204 may be performed by one or more physical processors executing one ormore engines as described above, in accordance with one or moreimplementations.

At an operation 206, the one or more columns are annotated, wherein theannotated columns are associated with the typeclass. Operation 206 maybe performed by one or more physical processors executing one or moreengines as described above, in accordance with one or moreimplementations.

FIG. 3 is a diagram of an interface 300 configured to display datasets,per some embodiments. The graphical user interface 300 shows a databasetable 302 with columns (e.g., columns 305 a, 305 b, 305 c, and 305 d).Additionally, the table may include any number of columns 305 n.

Any of the columns (e.g., columns 305 a, 305 b, 305 c, 305 d) may beassociated with typeclasses. For example, the column 305 b correspondingto “employee number” may be associated with an “employee identifier”typeclass. As mentioned, a typeclass can reference a particular datatype and/or particular data format. In some embodiments, a typeclass maybe associated with one or more validation rules. These validation rulescan be applied to data values 304 of columns that have been annotated(or labeled) with the typeclass.

More detail with regards to typeclasses are depicted in FIG. 4. FIG. 4is an interface 400 for annotating a dataset (e.g., database tablecolumn) with a particular typeclass. For example, a user may haveselected the column 305 d corresponding to usernames in FIG. 3 toannotate the column 305 d with a typeclass. After selecting the column305 d, the user may be directed to the interface 400. The interface 400provides the user with options for annotating (or labeling) a dataset402 (e.g., data values 304 corresponding to the column 305 d) with atypeclass. In this example, the user has the option to select from amongvarious typeclasses using a drop-down menu 404. When a typeclass isselected, the interface 400 can provide details 406 about the selectedtypeclass. In some embodiments, the interface 400 can provideinformation 408 about validations associated with the selectedtypeclass. As mentioned, these validations can be applied to the dataset402 (e.g., data values 304 corresponding to the column 305 d). Manyvariations are possible.

FIG. 5 is a diagram of an interface 500 displaying a join operation of afirst dataset and a second dataset, per some embodiments. The joinoperation includes a graphical depiction of proposed join operationsbetween datasets. In this example, the join operation involves a firstdatabase table 505 which includes at least one column 510 (“return id”)associated with a “return identifier” typeclass. In some embodiments,the first database table 505 can be enriched with data from otherdatabase tables. For example, in some embodiments, the first databasetable 505 can be enriched with data from other database tables that alsoinclude a column associated with the “return identifier” typeclass. Insuch embodiments, the join operation can join the first database table505 with another database table based on the columns associated with the“return identifier” typeclass. Many variations are possible.

In some embodiments, the interface 500 can provide suggestions 515 forenriching the database table 505. For example, the suggestions 515 caninclude other tables 520 that may be joined with the database table 505to enrich data in the database table 505. The user can select asuggested table 520 to perform a join operation with the database table505. Once joined, the interface 500 can provide a view of the firstdatabase table 505 joined with the selected table, for example, at thecolumn associated with the “return identifier” typeclass.

Hardware Implementation

FIG. 6 depicts a block diagram of an example of a computer system 600upon which any of the embodiments described herein may be implemented.The computer system 600 includes a bus 602 or other communicationmechanism for communicating information, one or more hardware processors604 coupled with bus 602 for processing information. Hardwareprocessor(s) 604 may be, for example, one or more general purposemicroprocessors.

The computer system 600 also includes a main memory 606, such as arandom access memory (RAM), cache and/or other dynamic storage devices,coupled to bus 602 for storing information and instructions to beexecuted by processor 604. Main memory 606 also may be used for storingtemporary variables or other intermediate information during executionof instructions to be executed by processor 604. Such instructions, whenstored in storage media accessible to processor 604, render computersystem 600 into a special-purpose machine that is customized to performthe operations specified in the instructions.

The computer system 600 further includes a read only memory (ROM) 608 orother static storage device coupled to bus 602 for storing staticinformation and instructions for processor 604. A storage device 610,such as a magnetic disk, optical disk, or USB thumb drive (Flash drive),etc., is provided and coupled to bus 702 for storing information andinstructions.

The computer system 600 may be coupled via bus 602 to a display 612,such as a cathode ray tube (CRT) or LCD display (or touch screen), fordisplaying information to a computer user. An input device 614,including alphanumeric and other keys, is coupled to bus 602 forcommunicating information and command selections to processor 604.Another type of user input device is cursor control 616, such as amouse, a trackball, or cursor direction keys for communicating directioninformation and command selections to processor 604 and for controllingcursor movement on display 612. This input device typically has twodegrees of freedom in two axes, a first axis (e.g., x) and a second axis(e.g., y), that allows the device to specify positions in a plane. Insome embodiments, the same direction information and command selectionsas cursor control may be implemented via receiving touches on a touchscreen without a cursor.

The computing system 600 may include a user interface module toimplement a GUI that may be stored in a mass storage device asexecutable software codes that are executed by the computing device(s).This and other modules may include, by way of example, components, suchas software components, object-oriented software components, classcomponents and task components, processes, functions, attributes,procedures, subroutines, segments of program code, drivers, firmware,microcode, circuitry, data, databases, data structures, tables, arrays,and variables.

In general, the word “module,” as used herein, refers to logic embodiedin hardware or firmware, or to a collection of software instructions,possibly having entry and exit points, written in a programminglanguage, such as, for example, Java, C or C++. A software module may becompiled and linked into an executable program, installed in a dynamiclink library, or may be written in an interpreted programming languagesuch as, for example, BASIC, Perl, or Python. It will be appreciatedthat software modules may be callable from other modules or fromthemselves, and/or may be invoked in response to detected events orinterrupts. Software modules configured for execution on computingdevices may be provided on a computer readable medium, such as a compactdisc, digital video disc, flash drive, magnetic disc, or any othertangible medium, or as a digital download (and may be originally storedin a compressed or installable format that requires installation,decompression or decryption prior to execution). Such software code maybe stored, partially or fully, on a memory device of the executingcomputing device, for execution by the computing device. Softwareinstructions may be embedded in firmware, such as an EPROM. It will befurther appreciated that hardware modules may be included of connectedlogic units, such as gates and flip-flops, and/or may be included ofprogrammable units, such as programmable gate arrays or processors. Themodules or computing device functionality described herein arepreferably implemented as software modules, but may be represented inhardware or firmware. Generally, the modules described herein refer tological modules that may be combined with other modules or divided intosub-modules despite their physical organization or storage.

The computer system 600 may implement the techniques described hereinusing customized hard-wired logic, one or more ASICs or FPGAs, firmwareand/or program logic which in combination with the computer systemcauses or programs computer system 600 to be a special-purpose machine.Per one embodiment, the techniques herein are performed by computersystem 600 in response to processor(s) 604 executing one or moresequences of one or more instructions contained in main memory 606. Suchinstructions may be read into main memory 606 from another storagemedium, such as storage device 610. Execution of the sequences ofinstructions contained in main memory 606 causes processor(s) 604 toperform the process steps described herein. In alternative embodiments,hard-wired circuitry may be used in place of or in combination withsoftware instructions.

The term “non-transitory media,” and similar terms, as used hereinrefers to any media that store data and/or instructions that cause amachine to operate in a specific fashion. Such non-transitory media mayinclude non-volatile media and/or volatile media. Non-volatile mediaincludes, for example, optical or magnetic disks, such as storage device610. Volatile media includes dynamic memory, such as main memory 606.Common forms of non-transitory media include, for example, a floppydisk, a flexible disk, hard disk, solid state drive, magnetic tape, orany other magnetic data storage medium, a CD-ROM, any other optical datastorage medium, any physical medium with patterns of holes, a RAM, aPROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip orcartridge, and networked versions of the same.

Non-transitory media is distinct from but may be used in conjunctionwith transmission media. Transmission media participates in transferringinformation between non-transitory media. For example, transmissionmedia includes coaxial cables, copper wire and fiber optics, includingthe wires that include bus 602. Transmission media can also take theform of acoustic or light waves, such as those generated duringradio-wave and infra-red data communications.

Various forms of media may be involved in carrying one or more sequencesof one or more instructions to processor 604 for execution. For example,the instructions may initially be carried on a magnetic disk or solidstate drive of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a modem. A modem local to computer system 600 canreceive the data on the telephone line and use an infra-red transmitterto convert the data to an infra-red signal. An infra-red detector canreceive the data carried in the infra-red signal and appropriatecircuitry can place the data on bus 602. Bus 602 carries the data tomain memory 606, from which processor 604 retrieves and executes theinstructions. The instructions received by main memory 606 may retrievesand executes the instructions. The instructions received by main memory606 may optionally be stored on storage device 610 either before orafter execution by processor 604.

The computer system 600 also includes a communication interface 618coupled to bus 602. Communication interface 618 provides a two-way datacommunication coupling to one or more network links that are connectedto one or more local networks. For example, communication interface 618may be an integrated services digital network (ISDN) card, cable modem,satellite modem, or a modem to provide a data communication connectionto a corresponding type of telephone line. As another example,communication interface 618 may be a local area network (LAN) card toprovide a data communication connection to a compatible LAN (or WANcomponent to communicated with a WAN). Wireless links may also beimplemented. In any such embodiment, communication interface 618 sendsand receives electrical, electromagnetic or optical signals that carrydigital data streams representing various types of information.

A network link typically provides data communication through one or morenetworks to other data devices. For example, a network link may providea connection through local network to a host computer or to dataequipment operated by an Internet Service Provider (ISP). The ISP inturn provides data communication services through the world wide packetdata communication network now commonly referred to as the “Internet”.Local network and Internet both use electrical, electromagnetic oroptical signals that carry digital data streams. The signals through thevarious networks and the signals on network link and throughcommunication interface 618, which carry the digital data to and fromcomputer system 600, are example forms of transmission media.

The computer system 600 can send messages and receive data, includingprogram code, through the network(s), network link and communicationinterface 618. In the Internet example, a server might transmit arequested code for an application program through the Internet, the ISP,the local network and the communication interface 618.

The received code may be executed by processor 704 as it is received,and/or stored in storage device 610, or other non-volatile storage forlater execution.

Engines, Components, and Logic

Certain embodiments are described herein as including logic or a numberof components, engines, or mechanisms. Engines may constitute eithersoftware engines (e.g., code embodied on a machine-readable medium) orhardware engines. A “hardware engine” is a tangible unit capable ofperforming certain operations and may be configured or arranged in acertain physical manner. In various example embodiments, one or morecomputer systems (e.g., a standalone computer system, a client computersystem, or a server computer system) or one or more hardware engines ofa computer system (e.g., a processor or a group of processors) may beconfigured by software (e.g., an application or application portion) asa hardware engine that operates to perform certain operations asdescribed herein.

In some embodiments, a hardware engine may be implemented mechanically,electronically, or any suitable combination thereof. For example, ahardware engine may include dedicated circuitry or logic that ispermanently configured to perform certain operations. For example, ahardware engine may be a special-purpose processor, such as aField-Programmable Gate Array (FPGA) or an Application SpecificIntegrated Circuit (ASIC). A hardware engine may also includeprogrammable logic or circuitry that is temporarily configured bysoftware to perform certain operations. For example, a hardware enginemay include software executed by a general-purpose processor or otherprogrammable processor. Once configured by such software, hardwareengines become specific machines (or specific components of a machine)uniquely tailored to perform the configured functions and are no longergeneral-purpose processors. It will be appreciated that the decision toimplement a hardware engine mechanically, in dedicated and permanentlyconfigured circuitry, or in temporarily configured circuitry (e.g.,configured by software) may be driven by cost and time considerations.

Accordingly, the phrase “hardware engine” should be understood toencompass a tangible entity, be that an entity that is physicallyconstructed, permanently configured (e.g., hardwired), or temporarilyconfigured (e.g., programmed) to operate in a certain manner or toperform certain operations described herein. As used herein,“hardware-implemented engine” refers to a hardware engine. Consideringembodiments in which hardware engines are temporarily configured (e.g.,programmed), each of the hardware engines need not be configured orinstantiated at any one instance in time. For example, where a hardwareengine includes a general-purpose processor configured by software tobecome a special-purpose processor, the general-purpose processor may beconfigured as respectively different special-purpose processors (e.g.,comprising different hardware engines) at different times. Softwareaccordingly configures a particular processor or processors, forexample, to constitute a particular hardware engine at one instance oftime and to constitute a different hardware engine at a differentinstance of time.

Hardware engines can provide information to, and receive informationfrom, other hardware engines. Accordingly, the described hardwareengines may be regarded as being communicatively coupled. Where multiplehardware engines exist contemporaneously, communications may be achievedthrough signal transmission (e.g., over appropriate circuits and buses)between or among two or more of the hardware engines. In embodiments inwhich multiple hardware engines are configured or instantiated atdifferent times, communications between such hardware engines may beachieved, for example, through the storage and retrieval of informationin memory structures to which the multiple hardware engines have access.For example, one hardware engine may perform an operation and store theoutput of that operation in a memory device to which it iscommunicatively coupled. A further hardware engine may then, at a latertime, access the memory device to retrieve and process the storedoutput. Hardware engines may also initiate communications with input oroutput devices, and can operate on a resource (e.g., a collection ofinformation).

The various operations of example methods described herein may beperformed, at least partially, by one or more processors that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Whether temporarily or permanentlyconfigured, such processors may constitute processor-implemented enginesthat operate to perform one or more operations or functions describedherein. As used herein, “processor-implemented engine” refers to ahardware engine implemented using one or more processors.

Similarly, the methods described herein may be at least partiallyprocessor-implemented, with a particular processor or processors beingan example of hardware. For example, at least some of the operations ofa method may be performed by one or more processors orprocessor-implemented engines. Moreover, the one or more processors mayalso operate to support performance of the relevant operations in a“cloud computing” environment or as a “software as a service” (SaaS).For example, at least some of the operations may be performed by a groupof computers (as examples of machines including processors), with theseoperations being accessible via a network (e.g., the Internet) and viaone or more appropriate interfaces (e.g., an Application ProgramInterface (API)).

The performance of certain of the operations may be distributed amongthe processors, not only residing within a single machine, but deployedacross a number of machines. In some example embodiments, the processorsor processor-implemented engines may be located in a single geographiclocation (e.g., within a home environment, an office environment, or aserver farm). In other example embodiments, the processors orprocessor-implemented engines may be distributed across a number ofgeographic locations.

Language

Throughout this specification, plural instances may implementcomponents, operations, or structures described as a single instance.Although individual operations of one or more methods are illustratedand described as separate operations, one or more of the individualoperations may be performed concurrently, and nothing requires that theoperations be performed in the order illustrated. Structures andfunctionality presented as separate components in example configurationsmay be implemented as a combined structure or component. Similarly,structures and functionality presented as a single component may beimplemented as separate components. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

Although an overview of the subject matter has been described withreference to specific example embodiments, various modifications andchanges may be made to these embodiments without departing from thebroader scope of embodiments of the present disclosure. Such embodimentsof the subject matter may be referred to herein, individually orcollectively, by the term “invention” merely for convenience and withoutintending to voluntarily limit the scope of this application to anysingle disclosure or concept if more than one is, in fact, disclosed.

The embodiments illustrated herein are described in sufficient detail toenable those skilled in the art to practice the teachings disclosed.Other embodiments may be used and derived therefrom, such thatstructural and logical substitutions and changes may be made withoutdeparting from the scope of this disclosure. The Detailed Description,therefore, is not to be taken in a limiting sense, and the scope ofvarious embodiments is defined only by the appended claims, along withthe full range of equivalents to which such claims are entitled.

It will be appreciated that an “engine,” “system,” “datastore,” and/or“database” may include software, hardware, firmware, and/or circuitry.In one example, one or more software programs comprising instructionscapable of being executable by a processor may perform one or more ofthe functions of the engines, datastores, databases, or systemsdescribed herein. In another example, circuitry may perform the same orsimilar functions. Alternative embodiments may include more, less, orfunctionally equivalent engines, systems, datastores, or databases, andstill be within the scope of present embodiments. For example, thefunctionality of the various systems, engines, datastores, and/ordatabases may be combined or divided differently.

The datastores described herein may be any suitable structure (e.g., anactive database, a relational database, a self-referential database, atable, a matrix, an array, a flat file, a documented-oriented storagesystem, a non-relational No-SQL system, and the like), and may becloud-based or otherwise.

As used herein, the term “or” may be construed in either an inclusive orexclusive sense. Moreover, plural instances may be provided forresources, operations, or structures described herein as a singleinstance. Additionally, boundaries between various resources,operations, engines, engines, and data stores are somewhat arbitrary,and particular operations are illustrated in a context of specificillustrative configurations. Other allocations of functionality areenvisioned and may fall within a scope of various embodiments of thepresent disclosure. In general, structures and functionality presentedas separate resources in the example configurations may be implementedas a combined structure or resource. Similarly, structures andfunctionality presented as a single resource may be implemented asseparate resources. These and other variations, modifications,additions, and improvements fall within a scope of embodiments of thepresent disclosure as represented by the appended claims. Thespecification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense.

Each of the processes, methods, and algorithms described in thepreceding sections may be embodied in, and fully or partially automatedby, code modules executed by one or more computer systems or computerprocessors comprising computer hardware. The processes and algorithmsmay be implemented partially or wholly in application-specificcircuitry.

The various features and processes described above may be usedindependently of one another, or may be combined in various ways. Allpossible combinations and sub-combinations are intended to fall withinthe scope of this disclosure. In addition, certain method or processblocks may be omitted in some embodiments. The methods and processesdescribed herein are also not limited to any particular sequence, andthe blocks or states relating thereto can be performed in othersequences that are appropriate. For example, described blocks or statesmay be performed in an order other than that specifically disclosed, ormultiple blocks or states may be combined in a single block or state.The example blocks or states may be performed in serial, in parallel, orin some other manner. Blocks or states may be added to or removed fromthe disclosed example embodiments. The example systems and componentsdescribed herein may be configured differently than described. Forexample, elements may be added to, removed from, or rearranged comparedto the disclosed example embodiments.

Conditional language, such as, among others, “can,” “could,” “might,” or“may,” unless specifically stated otherwise, or otherwise understoodwithin the context as used, is generally intended to convey that certainembodiments include, while other embodiments do not include, certainfeatures, elements and/or steps. Thus, such conditional language is notgenerally intended to imply that features, elements and/or steps are inany way required for one or more embodiments or that one or moreembodiments necessarily include logic for deciding, with or without userinput or prompting, whether these features, elements and/or steps areincluded or are to be performed in any particular embodiment.

Any process descriptions, elements, or blocks in the flow diagramsdescribed herein and/or depicted in the attached figures should beunderstood as potentially representing modules, segments, or portions ofcode which include one or more executable instructions for implementingspecific logical functions or steps in the process. Alternateembodiments are included within the scope of the embodiments describedherein in which elements or functions may be deleted, executed out oforder from that shown or discussed, including substantially concurrentlyor in reverse order, depending on the functionality involved, as wouldbe understood by those skilled in the art.

It should be emphasized that many variations and modifications may bemade to the above-described embodiments, the elements of which are to beunderstood as being among other acceptable examples. All suchmodifications and variations are intended to be included herein withinthe scope of this disclosure. The foregoing description details certainembodiments of the invention. It will be appreciated, however, that nomatter how detailed the foregoing appears in text, the invention can bepracticed in many ways. As is also stated above, it should be noted thatthe use of particular terminology when describing certain features oraspects of the invention should not be taken to imply that theterminology is being re-defined herein to be restricted to including anyspecific characteristics of the features or aspects of the inventionwith which that terminology is associated. The scope of the inventionshould therefore be construed in accordance with the appended claims andany equivalents thereof.

1. A system comprising: one or more processors; and a memory storinginstructions that, when executed by the one or more processors, causethe system to perform: determining at least a first database table to beannotated, the first database table including a set of columns and rowsof a dataset; determining at least one typeclass that applies to one ormore columns included in the first database table, wherein the typeclassdescribes values stored in the one or more columns; and annotating theone or more columns, wherein the annotated columns are associated withthe typeclass.
 2. The system of claim 1, wherein the typeclass describesmetadata information for the one or more columns.
 3. The system of claim2, wherein the metadata information describes a data type correspondingto data values in the one or more columns.
 4. The system of claim 2,wherein the metadata information describes a data format correspondingto data values in the one or more columns.
 5. The system of claim 1,wherein the typeclass is associated with one or more data validations.6. The system of claim 5, wherein the data validations are automaticallyapplied to validate values stored in the one or more columns.
 7. Thesystem of claim 1, wherein the instructions further cause the system toperform: joining the first database table with at least one seconddatabase table based at least in part on the one or more annotatedcolumns.
 8. A method being implemented by a computing system includingone or more physical processors and storage media storingmachine-readable instructions, the method comprising: determining atleast a first database table to be annotated, the first database tableincluding a set of columns and rows of a dataset; determining at leastone typeclass that applies to one or more columns included in the firstdatabase table, wherein the typeclass describes values stored in the oneor more columns; and annotating the one or more columns, wherein theannotated columns are associated with the typeclass.
 9. The method ofclaim 8, wherein the typeclass describes metadata information for theone or more columns.
 10. The method of claim 9, wherein the metadatainformation describes a data type corresponding to data values in theone or more columns.
 11. The method of claim 9, wherein the metadatainformation describes a data format corresponding to data values in theone or more columns.
 12. The method of claim 8, wherein the typeclass isassociated with one or more data validations.
 13. The method of claim12, wherein the data validations are automatically applied to validatevalues stored in the one or more columns.
 14. The method of claim 8,wherein the instructions further cause the processors to perform:joining the first database table with at least one second database tablebased at least in part on the one or more annotated columns.
 15. Anon-transitory computer readable medium comprising instructions that,when executed, cause one or more processors to perform: causing at leastone first table of a first database schema to be migrated to at leastone second table of a second database schema; determining a query formodifying the first table during the migration; modifying the secondtable based at least in part on the query; and updating a mutation tableto describe the modification, wherein the mutation table at leastdescribes the modification.
 16. The non-transitory computer readablemedium of claim 15, wherein the typeclass describes metadata informationfor the one or more columns.
 17. The non-transitory computer readablemedium of claim 16, wherein the metadata information describes a datatype corresponding to data values in the one or more columns.
 18. Thenon-transitory computer readable medium of claim 16, wherein themetadata information describes a data format corresponding to datavalues in the one or more columns.
 19. The non-transitory computerreadable medium of claim 15, wherein the typeclass is associated withone or more data validations.
 20. The non-transitory computer readablemedium of claim 15, wherein the data validations are automaticallyapplied to validate values stored in the one or more columns.